Cross-Lingual Genre Classification for Closely Related Languages

نویسندگان

  • Dirk Snyman
  • Gerhard B. van Huyssteen
  • Walter Daelemans
چکیده

Resource-scarcity is a topic that is continually researched by the HLT community, especially for the SouthAfrican context. We explore the possibility of leveraging existing resources to help facilitate the development of new resources for under-resourced languages by using cross-lingual classification methods. We investigate the application of an Afrikaans genre classification system on Dutch texts and see encouraging results of 63.1% when classifying raw Dutch texts. We attempt to optimise the performance by employing a machine translation pre-processing step, boosting performance of the Afrikaans system on Dutch data to 67.2%. Further investigation is required as we conclude that the robustness of the Afrikaans genre classification system needs improvement.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robust Cross-Lingual Genre Classification through Comparable Corpora

Classification of texts by genre can benefit applications in Natural Language Processing and Information Retrieval. However, a mono-lingual approach requires large amounts of labeled texts in the target language. Work reported here shows that the benefits of genre classification can be extended to other languages through cross-lingual methods. Comparable corpora – here taken to be collections o...

متن کامل

Label Propagation for Fine-Grained Cross-Lingual Genre Classification

Cross-lingual methods can bring the benefits of genre classification to languages which lack genre-annotated training data. However, prior work in this field has been evaluated on coarse genres only. To predict fine-grained genres across languages, we propose a label propagation method, which combines separate sets of features. The results are promising, as the approach outperforms most baselin...

متن کامل

The 5th Workshop on Building and Using Comparable Corpora

Classification of texts by genre can benefit applications in Natural Language Processing and Information Retrieval. However, a mono-lingual approach requires large amounts of labeled texts in the target language. Work reported here shows that the benefits of genre classification can be extended to other languages through cross-lingual methods. Comparable corpora – here taken to be collections o...

متن کامل

Cross-Lingual Genre Classification

Classifying text genres across languages can bring the benefits of genre classification to the target language without the costs of manual annotation. This article introduces the first approach to this task, which exploits text features that can be considered stable genre predictors across languages. My experiments show this method to perform equally well or better than full text translation co...

متن کامل

Cross-lingual porting of distributional semantic classification

This article presents experiments in the porting of semantic classification between two closely related languages, Swedish and Danish. We show that a classifier for the semantic property of animacy, trained on morphosyntactic distributional data for one language may be applied directly to data from another language with little loss in terms of accuracy.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012